Text processing and analysis was performed using the Natural Language ToolKit(NLTK, www.nltk.org). Custom python(http://python.org) scripts were written to call NLTK functions. All of the visualizations were implemented using SDL and SDL_Image(www.libsdl.org) and OpenGL. All of the work was performed at UNC Charlotte. Four python scripts (49-170 lines each) were used in the text processing. NLTK involves a fairly steep learning curve. Our group had prior experience with SDL and OpenGL. The visualization application took perhaps a month of effort to put together. Text Processing: The microblog file was processed using NLTK to produce a ranked, time ordered blog. This involved (1) keyword search, involving domain relevant words like 'flu', 'sweats', etc. (2) custom python scripts produced 30 random concordances of relevant words. Visual inspection of the concordances is used to build the grammar search scripts, (3) Grammar extraction ranked the blog entries into 7 categories(1-7, 1 is the most relevant), (4) Similar procedure produced a ranked event file. Visualization Tools: Our virus tracker application displays all output superimposed on the Vastopolis map with the following major features, (1) All blogs are points on the map and can be 'played' over time (yellow dots in Fig. 1) (2) Blogs can be filtered by rank (3) Blogs at an event are separately displayed, (4) areas of above average numbers of sick people can be highlighted as rectangular regions, permitting trends to be clarified (4) Upto 3 blog filters on specific terms and correspoding highlighting (used in identifying the water-borne outbreak).
Video:
MC 1.1 Origin and Epidemic Spread: Identify approximately where the outbreak started on the map (ground zero location). If possible, outline the affected area. Explain how you arrived at your conclusion.
On the 19th we were able to detect the virus trending largely in two directions. We further isolated this by overlaying an adaptive grid, as seen in Fig. 5. The grid automatically computes an average number of data points for each sector based on the initial data set, and highlights areas which are currently above their average according to a set threshold. The resolution of the grid can be adjusted, for instance, a lower resolution grid will compute a more accurate average if the initial data set is not very dense. By using this we clearly defined a trend along the interstates through the east side, and another along the banks of the Vast River, moving southwest, and originating close to the Downtown area.
The trend along the interstates was expected, however the trend along the river was a new development which needed further analysis. To deal with this, we randomly sampled a few points from the river to identify common symptoms. This quickly yielded results related to 'stomach' pain, 'nausea', 'diarrhea', and the like, whereas other regions seemed to have symptoms closer to a 'fever or 'flu'. In order to rapidly verify this in a visual sample, we developed a custom filter tool, which can search the raw data set on the fly, and highlight matches in a given color. Doing this quickly gave us visual confirmation that there was indeed a separate virus likely making its way down the river. By highlighting flu symptoms in one color and common waterborne symptoms in another, we were able to see that each was distinct with very little overlap. This is evident in Fig. 6, with the green dots corresponding to blogs matching "nausea" or "stomach", while the red dots match "fever".
Fig. 6: May 19, 10.22am-4.22pm. In order to
highlight the waterborne contaminant and its spread,
specialized filters searching for 'nausea', 'fever' and 'stomach'
highlight and distinguish the two types of viral spreads(green versus
red dots(blogs)). It can be clearly seen that the green dots are flowing along the river.
.
Further confirming the waterborne contaminant was the fact that its progression down the river was consistent with the flow of the river, and it spread southward and out of Vastopolis in a visible and predictable fashion. We were able to track these ailments which mostly made their way out of the region by the morning of the 20th, with just a few lingering traces.
At this point the airborne virus also seems to be in recession, with many of the infected left in the hospitals, whereby taking samples from the hospitals would seem to indicate that the virus has been confirmed as some type of flu.